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ABSTRACT 


Proof of retrievability (POR) and proof of data possession (PDP) are cryptographic tools 
for auditing big data on a storage server or in the cloud. Their goals are to verify that 
the server is storing data and, in case of data alteration, recovering this data. These tools 
provide probabilistic guarantees that the server is storing information, without accessing 
the entire file and providing the capability to recover the original data under certain limits. 
In this work, we study maximum distance separable (MDS) codes as the underlying tools 
providing recoverability for POR. We survey MDS codes and select Reed-Solomon and 
Cauchy Reed-Solomon MDS codes to be implemented into a prototype POR library. We 
use the liberasurecode library to evaluate multiple error-correcting code (ECC) backend 
implementations for these codes. We enhance the libpdp library, an open source PDP 
library that implements some PDP schemes, to interface with liberasurecode to measure 
the real-world cost of integrating erasure coding in POR implementations. 
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CHAPTER 1: 
Introduction 


Both proof of data possession (PDP) and proof of retrievability (POR) are cryptographic 
protocols for auditing data stored in remote servers without accessing an entire file. PDP 
and POR are like a digital signature but are efficient for periodic audits of massive data sets, 
at the cost of weaker, probabilistic integrity guarantees. 

Auditing cloud storage is an important issue that many scholars have studied. While different 
approaches have been suggested, no open source solutions exist providing the features or 
ease-of-use to inspire wide adoption in auditing and recovering data held by cloud storage. 
Indeed, further study is warranted in investigating the practicality of POR toward the goal 
of an open source cloud storage audit solution. 

Few analyses have been performed comparing the cost of retrievability for POR schemes. 
Proposed POR schemes employ some error-correcting code (ECC) mechanism, which 
imposes an additional encoding cost, storage overhead, and decoding costs during data 
tagging, recovery and update. 

In this thesis, we implement and evaluate POR schemes supporting recovery based on 
maximum distance separable (MDS) codes. We explore the real-world costs of MDS codes, 
to understand consequential issues for making POR practical in the single-server model. 
We focus on encoding, storage and tagging costs. We defer exploring decoding and update 
cost to future work. 

Our work makes the following primary contributions: 

• We enhance an open source PDP library, libpdp, to provide retrieval guarantees; 

• We explore the costs associated with POR from practical, rather than asymptotic, 
perspectives. 

• As expected, we find the cost of tagging to be proportional to the increase in the file 
size resulting from the use of the MDS code and closely follows the performance 
costs reported by Bremer [1], 

• We find that the parity data add additional storage cost and the ratio m/k defines the 
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overhead amount. 


1.1 Motivation 

In today’s Information age, where computing and communication technologies become 
powerful and advanced, people are exchanging a huge amount of data, and they are de¬ 
manding more storage capability for personal, business, or even military needs. The 
International Data Corporation’s sixth annual study of the digital universe [2] showed that 
the digital universe data is growing by 40 percent a year. They stated also that the digital 
data was about 4.4 ZB in 2013 and it is expected to reach 44 ZB by 2020. Furthermore, 
users require access to their data even when they are away from their personal computer or 
portable storage devices. Today’s users expect on-demand access to their data anywhere in 
the world. The cloud is a suitable solution to provide the benefits of high storage capability, 
greater accessibility, reliability, archival, and data backup. For instance, the same IDC 
report states that, by 2020, almost 40°/c of the digital universe data will be “touched” by 
cloud computing providers. Also, in 2012, DOD Chief Information Officer Teresa M. Takai 
released the Department of Defense Cloud Computing Strategy in an effort to cope with the 
massive growth of data and to reduce data storage costs. 

Yet the cloud presents another issue related to data integrity. It is not certain if the user can 
rely entirely on the cloud server to manage his data. Also, it is impossible to be sure of 
the server’s behavior. Consequently, the security of stored data is one of the big concerns 
for organizations and individuals. For instance, a 2015 Vormetric data security study [3] 
stated that 60% of IT decision makers report keeping sensitive data locally, not in the cloud. 
Their concerns are related to lack of data control and increased vulnerabilities. 


1.2 Outline 

In Chapter 2, we provide background on proof of retrievability approaches and error- 
correcting codes. In Chapter 3, we describe a specific type of ECC that is capable of 
correcting adversarial erasures and survey codes satisfying this requirement. In Chapter 4, 
we discuss design criteria for using MDS codes in POR schemes. In Chapter 5, we discuss 
a prototype POR library and evaluate the performance related to the aspects introduced to 
support retrievability. In Chapter 6, we conclude and discuss future work. 
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CHAPTER 2: 
Background 


In this chapter, we review proof of retrievability (POR) schemes, the single server and 
distributed failure models, and the role of error-correcting codes (ECCs) in retrieving data. 

2.1 Proof of Retrievability 

POR enables a client to remotely audit stored data in a manner robust against adversarial 
erasures, at a cost substantially less than traditional digital signatures. This is achieved 
through an interactive challenge-response protocol implementing a random, sublinear audit 
of file integrity. Assuming some 6-fraction of audits is passed, POR schemes admit an 
efficient mechanism to extract and recover the original file. A summary of major POR 
schemes and their properties is provided in Table 2.1. 

Juels and Kaliski [4] and Naor and Rothblum [5] proposed the first approaches to remotely 
audit cloud storage at sublinear cost. Their framework was based on splitting the original 
file into chunks, signing chunks using a cryptographic message authentication code (MAC) 
scheme to generate a set of data tags, and storing both the data chunks and tags remotely. The 
file could then be remotely audited using an interactive protocol, where a random subset of 
chunks and their corresponding tags could be retrieved and verified. This random sublinear 
audit could be repeated, arbitrarily increasing the client’s confidence in the integrity of the 
data without retrieving the entire file. 

A generalization of these early schemes was later shown to admit efficient mechanisms for 
file recovery, yielding a POR scheme [6]. Ateniese et al. [7] and Shacham and Waters [6] 
proposed schemes with lower communication cost, also shown to admit file recovery under 
appropriate pre-processing steps. These schemes are all in the single-server failure model. 
In this model, all data is held by a single remote cloud service and data loss occurs due to 
adversarial block-level erasures. File pre-processing uses an appropriate ECC to achieve 
recovery against adversarial erasure (see Chapter 3). 

Later schemes considered a different failure model, the multiple-server failure model. In 
this model, all data is distributed across multiple servers and data loss occurs due to 
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adversarial file deletion at the server level. Three broad approaches exist to add redundancy 
in distributed storage systems: file replication, network coding (NC) and error-correcting 
codes (ECCs). POR recovery mechanisms have been based on each of these approaches. 
Curtmola et al. [8] provide a POR scheme ensuring that multiple unique copies of the data 
are stored across different servers. Chen et al. [9] propose a POR scheme for network 
coding-based distributed storage. Bowers et al. [10] introduce a POR scheme, which 
permits a set of servers to prove to a client that data is stored and retrievable, employing 
error correction codes. 


Table 2.1: Comparison of existing POR schemes. 


Model 

Scheme 

Year 

Recovery 

Storage 

Communication 

approach 

complexity 

complexity 

Single 

Server 

MAC-PDP [4], [5] 
A-PDP [7] 

2007 

2007 

ECC 

ECC 

0(1*1 + n\cr\) 
0(1*1 /*) 

O(l) 

O(l) 

CPOR [6] 

2008 

ECC 

0(1*1/* + n\cr\) 

O(l) 

Multiple 

Server 

MR-PDP [8] 

2008 

replication 

0(1511*1) 

0(1*1) 

HAIL [10] 

2009 

ECC 

0(|S||*l/*) 

0(1*1) 

RDC-NC [9] 

2010 

NC 

0(2\S\\F\/(k + l)) 

0(2\F\/(k + 1)) 


See Section 2.3 for relevant notation. For multiple server schemes, |S| de¬ 
notes the number of servers employed; for HAIL, |S| = k + m where k 
primary servers store the original fragments and m secondary servers store 
parity fragments. Adapted from [9, Table 1]: B. Chen, R. Curtmola, G. Ate- 
niese, and R. Burns, “Remote data checking for network coding-based dis¬ 
tributed storage systems,” in Proceedings of the 2010 ACM Workshop on 
Cloud Computing Security, 2010, pp. 31-42. 


2.2 Error Correction Codes 

Error detection and error correction are mathematical procedures enabling reliable storage 
and transmission of data, providing a way to detect and correct errors caused by channel 
noise. The idea is to transform the data while incorporating redundancy, allowing the 
original data to be recovered even in the presence of some limited transmission errors (see 
Figure 2.1). Shannon [11] first proposed a mathematical model for data transmission across 
a channel in the presence of noise, effectively defining the field of information theory. In this 
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model, Shannon proved a theoretic limit (the noisy-channel coding theorem) on the maxi¬ 
mum efficiency of any error correction strategy under some rate of stochastic/probabilistic 
transmission errors. Hamming [12] provided the first concrete ECC to achieve this optimal 
rate with minimal distance between codewords, called a perfect code. Hamming’s code 
used a combinatorial argument to ensure unique decoding, rather than an argument based 
on the likelihood of correlated errors, thus allowing arbitrary error detection up to some 
threshold. 


Original Data 



Redundancy Recovered Data 



Figure 2.1: Error recovery using ECC. 

Adapted from [13]: J. S. Plank and C. Huang, “Tutorial: Erasure coding for 
storage applications,” Slides presented at FAST-2013: 11th Usenix Confer¬ 
ence on File and Storage Technologies, San Jose, February 2013. 

Broadly, ECCs are divided into two categories: block codes and convolution codes. Block 
codes divide data into blocks, called datawords, and transform each dataword into a code¬ 
word. In convolution codes, redundancy not only depends on the current inputs bits but 
also some amount of the previous bits. 

For ECCs, the two most common channel error models are independent errors and burst 
errors. Independent, or random, errors are those produced by a memoryless, stochastic 
noise process: codewords are impacted by noise according to some constant probability, 
with no correlation in time or location among errors. Burst errors are those where noise 
is localized in short intervals rather than at random. For example, the Gilbert-Elliot burst 
error model uses a two-state Markov process to switch, with some probability, between 
“good” and “burst” states, incurring transmission errors in the burst state with some rate. 
In the real world, burst errors are intended to describe physical processes with correlated 
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noise; for example, errors resulting from physical irregularities in storage media or structural 
alteration are likely not independent but, rather, tend to be spatially concentrated. 

The type of error model expected is crucial to selecting a suitable error correction code. 
Codes that are good at correcting random errors may do a poor job at correcting burst errors, 
or vice-versa. Other error models have been proposed, including the adversarial error model 
(discussed further in Chapter 3) and the computationally bounded error model [14], [15]. 
Both consider more general failure scenarios, where deletions may be performed by an 
active adversary, selecting noise following some adversarial strategy rather than at random. 

2.3 Notation 

In this work, we consider the properties of POR schemes in the single-server failure model. 
For these, a client C transforms a file F into an encoded file F. The client stores this data 
with some cryptographic metadata (tags) using some remote storage service S. Later, the 
client or a third-party auditor will remotely verify the integrity of the file data through an 
interactive protocol. Assuming some e-fraction of audits succeed, it should be possible to 
extract the original file F. In the remainder of this thesis, we employ the following notation 
whenever describing these schemes: 

F The original file of size |F|. 

k An encoding parameter, used to divide F into k data blocks. 

// The data block size, // = \\F\/k~\. 

w The data block is considered to be a series of w-bit words. 
m An encoding parameter, used to extend the k data blocks into m parity blocks. 

F The encoded file, to be tagged and remotely stored, 
cr A single tag of size |cr|. 

n An audit parameter, the total number of chunks in F. 

A An audit parameter, the number of chunks to be queried on each audit. 

Some encoding schemes may divide F into L logical blocks of size i, encoding each logical 
block into k data blocks and m parity blocks. Some audit schemes may further divide 
audit chunks into sectors. The previously identified parameters will be the most useful and 
common notation for our discussion. 
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CHAPTER 3: 

Adversarial Erasure Coding 


In this chapter, we define and review adversarial erasure codes. We survey maximum 
distance separable codes, which are able to correct worst-case errors like those considered 
by Hamming [12] and are one category of adversarial erasure code. 


3.1 Definition 

Hamming’s adversarial channel [12] introduces noise in a worst-case manner, flipping or 
erasing bits arbitrarily according to some computationally unbounded strategy up to some 
error rate. Lipton [14] proposes a variant of this model in which the adversarial strategy 
is computationally bounded , modeled by an arbitrary polynomial-time Turing machine. 
Shacham and Waters [6] give a construction for an adversarial erasure code able to correct 
worst-case errors placed following a computationally bounded strategy, using an efficient 
linear-time erasure code, scrambled and protected using encryption; this essentially causes 
any computationally bounded strategy to have no better success than random erasures. 
Bowers et al. provide a related construction for a systematic, adversarial error correcting 
code. 


3.2 Maximum Distance Separable Codes 

The Singleton bound describes the possible relationship between codeword length and the 
distance between codewords. For a linear code with c codewords, each of size M and 
minimum distance d, the bound is typically expressed as 


d < c - M + 1. 


MDS codes are linear codes in which the minimum distance between any two codewords 
meets the Singleton bound with equality [16]. These have been proposed for the fault 
tolerance in many applications, including cloud storage. For example, Bowers et al. propose 
maximum distance separable codes to ensure retrievability of data stored in distributed 
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storage [10]. 


MDS codes divide a file F into k blocks of size These k blocks are used to calculate 

k 

m parity fragments. The k blocks and m parity blocks are concatenated together to form 
n data blocks, which can be considered an array of w-bit codewords. If the encoded file 
is altered, the original file can be recovered from any k of the n encoded fragments. The 
parameters k and m are parameters that depend on the type of code and user requirements. 

Shacham and Waters observe that erasure codes derived from Reed-Solomon codes have 
the property that they can decode in the presence of adversarial erasures [6, §1.1]. The 
property desired is exactly those of MDS codes, able to correct up to m adversarial erasures. 
These codes are traditionally criticized for the reasons that (i) they are inefficient and (ii) 
their goals are unrealistically strong, i.e., correcting noise placed by an adversary that is 
computationally unbounded. More efficient codes are possible and discussed by Shacham 
and Waters [6] and Bowers et al. [17]. We evaluate MDS codes as a first step toward the 
future implementation of libraries supporting more efficient adversarial erasure codes. 


3.3 MDS Survey 

In this section, we survey MDS codes, comparing those parameters that appear consequential 
to storage and retrievability for POR schemes. Different ECCs have different structures and 
aim to satisfy specific goals: some seek to increase recovery efficiency, others focus on 
minimizing recovery costs, etc. We focus on the following properties: 

Coding chunks. A bound on m, the number of blocks reserved for parity. 

Storage overhead. The difference in the amount of stored data between an original file and 
the encoded file. 

Fault tolerance. The maximum number of blocks that may be lost while allowing recovery 
of the original file. 

Cost of retrievability. The number of fragments needed to recover from one failure. 

Results are summarized in Table 3.1. For details on the MDS codes summarized here, we 
refer the reader to the existing survey work of Plank et al. [18] and Schnjakin et al. [19]. 
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EVENODD and Minimal Density codes have limits on the number of coding chunks and, 
therefore, limits their fault tolerance. Reed-Solomon (RS), Cauchy Reed-Solomon (CRS), 
and Rotated RS are more flexible, placing no bounds on the number of coding chunks that 
may be added. 


Table 3.1: Comparison of MDS code properties. 


Code 

Coding 

Chunks 

Storage 

Overhead 

Fault 

Tolerance 

Cost of 
Retrievability 

RS [20] 

oo 

n/k 

n — k 

n— 

w 

CRS [21] 

oo 

n/k 

n — k 

<11 — 
w 

EVENODD [22] 

2 

1+2 Ik 

2 


Minimal Density [23] 

2 

1+2 /k 

2 

< 11 — 
w 

Rotated RS [24] 

OO 

n/k 

n — k 

L(„ + !1\E 

2V* + k> w 


Adapted from [25]: M. Schnjakin, T. Metzke, and C. Meinel, “Applying 
erasure codes for fault tolerance in cloud-raid,’’ in Proceedings of the IEEE 
16th International Conference on Computational Science and Engineering 
(CSE), 2010, pp. 66-75. 

For recovery, RS codes are considered expensive due to multiplication operations during 
decoding. While their asymptotic costs appear similar, CRS are considered to provide 
more acceptable recovery cost. They reduce of the number of operations by using Cauchy 
matrices (which have fewer ones than the Vandermonde matrices used RS codes) and XOR 
operations (rather than multiplication). Rotated RS codes transform chunks of w -bit size 
words into r groups of bit-words. These have improved recovery performance since they are 
designed to provide better input/output efficiency: accessing data at the bit-level requires 
less data to be read during recovery. 

In recent years, several open-source implementations of MDS codes have been made avail¬ 
able. InTable3.2, we update earlier survey work ofPlanketal. [18] and Schnjakin etal. [19], 
excising libraries that are no longer available and including new libraries that have been 
developed. Most libraries are written in C/C++, the exception being JavaReedSolomon. 
Intel’s ISA-L library provides raw interfaces to take advantage of hardware acceleration 
for operations used in erasure coding, as well as providing direct implementation of some 
codes [26]. The library liberasurecode supports multiple ECC backends, including Jera¬ 
sure and ISA-L. Almost all libraries support RS, CRS or both, each providing different 
implementations and optimizations. 
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Table 3.2: Survey of MDS support in open source ECC libraries. 


Library 

RS 

CRS EVENODD 

Minimal Density 

Rotated RS 

Intel ISA-L [26] 

y 

y 

- 

- 

- 

JavaReedSolomon [27] 

y 

- 

- 

- 

- 

Jerasure [28] 

y 

y 

- 

y 

- 

Kodo [29] 

y 

- 

- 

- 

- 

liberasurecode [30] 

y 

y 

- 

y 

- 

Luby [21] 

- 

y 

- 

- 

- 

OpenFEC [31] 

y 

- 

- 

- 

- 

PyECLib [32] 

y 

y 

- 

y 

- 

Zfec [33] 

y 

- 

- 

- 

- 


Adapted from [25]: M. Schnjakin, T. Metzke, and C. Meinel, “Applying 
erasure codes for fault tolerance in cloud-raid,” in Proceedings of the IEEE 
16th International Conference on Computational Science and Engineering 
(CSE), 2010, pp. 66-75. 


Many of these implementations are used in practical storage solutions or made accessible 
through other libraries. The library PyECLib is a python wrapper for libemsurecode [32], 
developed for use in Swift [34]. The library librain is a wrapper for Jerasure used by 
the OpenlO software defined storage project [35]. As of Hadoop Distributed File System 
3.0, the HDFS project has incorporated its own Java implementations of XOR and RS 
codes, taking advantage of the ISA-L library for hardware acceleration [36], [37], the goal 
of incorporating ECC with HDFS having been previously explored by Facebook’s HDFS- 
RAID project [38]. 

Plank et al. [39] provide a practical-oriented performance analysis of many available open- 
source ECC libraries. Their evaluation shows that Zfec implements the fastest classic 
RS coding, with Jerasure showing the best implementation among the other schemes 
evaluated. They highlight some parameters that impact performance in each library. Almost 
all implementations evaluated are sensitive to the codeword size, w. In particular, w e 
{8,12,13,14,16,19,24,26,27, 30,32} yields poor performance for CRS, since the primitive 
polynomial for these fields have one more bit set than in other cases. Plank et al. also reflect 
on the role memory and cache footprints play in implementation efficiency; in particular, 
they attribute the better performance of Zfec to its smaller memory footprint. 
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The EVEN/ODD code is not an adequate solution for practical adversarial error correction 
since it has limits on the number of errors that can be corrected, and there are no publicly 
available implementations. Following the same rationale, Minimal Density will not satisfy 
our needs. Schnjakin et al. conclude that Rotated RS is not beneficial in cloud storage 
applications because most cloud services do not support partial reads on data objects, 
nullifying its intended efficiency optimizations. Hence, we select RS and CRS codes for 
our evaluation, using liberasurecode to evaluate multiple backend ECC implementations of 
these. 
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CHAPTER 4: 
Design 


In this chapter, we provide an overview of the requirements and design for a proof of 
retrievability library supporting recovery. We review strategies for managing metadata and 
parity data for the library. 

4.1 Interfaces and Requirements 

Our POR library should implement all those interfaces required for applications seeking 
to implement POR challenge/response protocols. Following the POR model described by 
Shacham and Waters [ 6 ], we define a proof of retrievability scheme in the symmetric setting 
using the following interfaces: 

- POR_KeyGen( 1 *), the function used to generate the keys K employed by the scheme. 

- POR_Encode(X, F), the function encoding the original file F using the adversarial 
erasure code, to generate the encoded file F. 

- POR_Tag(W, F ), the function generating the metadata 77 and tags cr for F. 

- POR_Challenge( K, 77 ), the function generating a challenge c to be sent to the proven 

- POR_Proof(c, cr, F), the function generating the proof y as response to challenge c. 

- POR_Verify( AC c, y), the function verifying the correctness of the proof y. 

- POR_Recover(W, F'), the function that attempts to recover the file F from some, 
possibly incomplete, version of F. 

Furthermore, our POR library design must be flexible enough to support a variety of 
symmetric-key and public-key POR schemes. The library should support different erasure 
codes. Also, it should be modular and support multiple error-correcting code backend 
libraries, i.e., different implementations of those codes. 

4.2 Data and Memory Utilization 

The file F may be arbitrarily large and, thus, may not fit in working memory during the 
encode and tag processes. Prior performance studies demonstrate that care must be taken 
when utilizing ECC and selecting parameters such as chunk size and code word size. Thus, 
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our design sub-divides F into L logical blocks of size £, processed independently. Each 
logical block will be considered as k data blocks and expanded into m parity blocks. How to 
organize these blocks without further, and unnecessarily, increasing the remote data storage 
requires discussion. Abstractly, we consider three general approaches to managing these 
chunks: 

1. Store the data blocks and parity blocks in separate, logical, remote files (see Fig¬ 
ure 4.1). 

2. Interleave data and parity blocks in a single, logical, remote file (see Figure 4.2). 

3. Store the data blocks contiguously, appending all parity blocks, to form a single, 
logical, remote file (see Figure 4.3). 


Original data 



data party 



Figure 4.1: ECC block storage design (Option 1). 


Option 1 provides the user with “natural” access to the original file when there are no 
failures, as the data file can be made identical to F. This removes the need to decode on 
every access. In the presence of errors, recovery can be achieved by reading incrementally 
from both files. To encode, we read £ bytes from F and encode this as k data blocks of in 
parity blocks, each of size £/k\ we write each block, sequentially, to its respective file. To 
decode, we read £ bytes from the data file and mC/k bytes from the parity file. We need to 
repeat the procedure until every logical block is recovered, combing these to recover F. 
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Original data 



data parity 

Original data Parity data 
block block 

Figure 4.2: ECC block storage design (Option 2). 


Option 2 is similar to the first option procedurally, but data and parity blocks are stored 
interleaved as a single logical file. This storage makes regular access inconvenient because 
it does not allow “natural” access to the original file data F. 


Original data 



data parity 



Figure 4.3: ECC block storage design (Option 3). 
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Option 3 stores data and parity in a single file, but stores all parity data at the end. This allows 
“regular” access to the file data through accessing the first \F\ bytes of the encoded file’s 
prefix. Procedurally, however, encoding requires more memory or intermediate storage to 
hold parity blocks and write these at the end. Also, decoding requires scanning the encoded 
file to retrieve those parity blocks associated with the data blocks to recover each logical 
block. 

Option 1 is most convenient for client access but inconvenient for audit, since most POR 
literature describes how to audit a single file, rather than two files that must be treated as 
a single, logical file. Comparatively, Option 2 and Option 3 yield simpler audit logic. Of 
these options, Option 2 has simpler encode and decode logic. We select Option 2 for our 
design, while noting that Options 1 and 3 have preferable properties with respect to file 
access in the absence of errors. We defer exploring these options more completely to future 
work. 
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CHAPTER 5: 

Implementation and Analysis 


In this chapter, we describe our proof of retrievability implementation employing erasure 
coding. Then, we discuss and analyze the performance of this implementation for those 
metrics of relevance to real-world cost. 


5.1 Implementation 

Our implementation is an extension of the open source library libpdp, a library supporting 
proof of possession written in C that currently does not include the capability to restore 
data under partial erasure [40]. As discussed in Chapter 3, there are many open source 
libraries implementing erasure codes. Our primary solution is to extend libpdp to support 
recoverability by employing an appropriate erasure code library. The current libpdp library 
provides the following five interfaces: 

• pdp_key_gen : generates keys needed by the scheme; 

• pdp_tags_gen\ generates the tag data related to the file to be stored in the server; 

• pdp_challenge_gen : generates the challenge to be sent to the Prover; 

• pdp_proof_gen\ generates the proof to answer the challenge; 

• pdp_proof_verify: verifies the given proof. 

The libpdp library currently supports the following four PDP schemes: 

• MACPDP: the simple MAC-based PDP scheme; 

• APDP: the Ateniese et al. PDP scheme [7]; 

• CPOR: the Shacham and Waters POR scheme, implemented as a PDP scheme [6]; 

• SEPDP: the Ateniese et al. PDP scheme [41]. 

We extend libpdp with new interfaces, relying on an existing backend erasure code library. 
We adopt the liberasurecode library, which not only implements some erasure codes but 
also interfaces with existing ECC libraries. The liberasurecode library consists of three 
main interfaces: 
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• liberasurecode_encode: generates the erasure encoded data; 

• liberasurecode_decode : reconstructs the original data from a set of at least k 
encoded fragments; 

• liberasurecode_reconstruct_fragment: reconstructs a missing fragment from a 
subset of available fragments. 

The function libercisurecode_encode transforms a logical blocks into k data fragments 
and m parity fragments. Figure 5.1 summarizes this behavior. As part of encoding, this 
library function adds an 80 byte header to each encoded fragment, resulting in 80// bytes of 
additional information. The fragments will be serialized and stored as previously described 
(see Chapter 4, Option 2). We reduce the storage overhead associated with encoding data 
by removing these headers during storage and reconstructing them before decoding. 
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Figure 5.1: Encoding with liberasurecode. 


We introduce two new interfaces for the libpdp library: 

• pdp_preprocess\ encodes the original file; 

• pdp_recover : reconstructs the original file from k blocks of the encoded file. 
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5.2 Experiment Design 

Our goal is to provide a real-world cost analysis of using erasure codes to implement POR 
schemes. Bremer provides a cost analysis of the PDP schemes available in libpdp [1]. We 
analyze our POR implementation using similar metrics and methodology to compare with 
Bremer’s prior study. 

Plank et al. provides a performance evaluation of select MDS code implementations [18]. 
In part, their goal is to determine the ( k , m, w) parameters that ensure optimal storage 
overhead, best fault tolerance, and fastest encoding/decoding rate. While we aim for many 
of the same goals, we re-evaluate their results to establish their applicability to our setting. 
In particular, their primary application is redundant array of independent disks storage, 
and our cloud storage setting is quite different. Also, the libraries we explore are different 
and more recent than this in earlier study, called via the liberasurecode wrapper library. 
Otherwise, we attempt to follow a comparable procedure to that of Plank et al., to determine 
good ECC parameters to be employed by our POR library. We study four MDS codes 
implementations: 

- Reed-Solomon implemented by Jerasure, denoted JRS. 

- Cauchy Reed-Solomon implemented by Jerasure, denoted JRC. 

- Reed-Solomon implemented by liberasurecode, denoted LER. 

- Reed-Solomon implemented by ISA-Intel, denoted ISA. 

We explore these using three ( k , m ) combinations and varying the value of w from 4 to 32, 
following the parameter space explored by Plank et al. 

Finally, we test the performance of the available POR schemes with file pre-processing 
using an MDS code with ( k, m, w ) parameters that appear to perform well. This allows us 
to interpret the real-world cost of POR schemes in the context of Bremer’s prior work [1], 


5.2.1 Experimental Environment 

Our test environment is a Dell laptop running Windows 7 enterprise with a 64-bit 2.7GHz 
Intel Core i7, with 8GB of RAM. The host is running a VMware version 7.0.0. The virtual 
machine is 64-bit Ubuntu 14.04-LTS assigned 3GB memory. Actually, this environment 
has more memory and faster processing capabilities than Plank’s test environment. These 
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capabilities allow us to easily run the virtual machine and get reasonable results in an 
acceptable amount of time. 

5.2.2 Cloud Environment 

As POR is designed to audit cloud storage, the most natural environment to evaluate 
our library is a cloud environment. Our initial experimentation used a virtual machine 
running on Amazon Web Services (AWS): an Amazon c3.xlarge instance running 64- 
bit, Ubuntu 14.04-LTS with 8GB memory and using Hardware-assisted Virtual Machine 
(HVM) virtualization. These preliminary tests revealed strange I/O behavior (see Figure 
5.2). After a short period of time, all I/O operations suffer a dramatic loss of performance. 
This behavior is demonstrated in some form, no matter the order of the tests, file size or 
parameters. We did not observe this behavior using our desktop virtual test environment. 
We defer further study of this I/O behavior on Amazon instances to future work. 
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(a) Word size vs. preprocessing time (b) Test order vs. preprocessing time 

Figure 5.2: Preprocessing performance for k =6 and m =2 on Amazon 
c3.xlarge, showing performance sensitivity related to test order. 


5.3 Analysis 

As previously outlined, our first set of experiments is intended to remake Plank [18] open 
source erasure codes libraries performances. Then, we test the preprocessing time to 
characterize the encoding speed compared to the file size. After, we study the tag cost 
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that could reveal the impact of introducing erasure code to four POR schemes. Finally, we 
measured the storage overhead resulted by the addition of the erasure code and compared 
to prior Bremer [1] study. 


5.3.1 Encoding Performance 

We investigate three ( k , m ) combinations—(6,2), (12,4) and (14,2)—each time employing 
a randomly generated 1 GB file. We vary word size w from 4 to 32 for all schemes; however, 
per its documentation, JRS only meaningfully handles values w = 8,16, 32. Each trial is 
repeated five times. We investigate using two logical block sizes: 100 KB (as explored by 
Plank et al.), and 64 MB (a common maximum chunk size for some distributed cloud file 
systems). We instrumented the liberasurecode_encode function to perform measurement, 
avoiding calls like read and write that incur I/O. This was an attempt to measure encoding 
speed more closely following prior work of Plank et al., which excluded I/O costs. 



Figure 5.3: Word size vs. encoding speed for k= 6, m= 2, /’=100KB. 

Figures 5.3, 5.4 and 5.5 summarize the results of these experiments using 100 KB logical 
block size. There are significant differences compared to the prior performance analysis of 


21 











Figure 5.4: Word size vs. encoding speed for k= 12, m= 4, ^=100KB. 

Plank et al. The first is that our test environment demonstrates, overall, a much higher en¬ 
coding speed for all schemes evaluated. We expected, however, to see relatively comparable 
performance patterns. For example, we expected JRC to show better performance than JRS 
across all word sized; however, we observed no such patter. Also, compared to Plank et 
al., we did not observe the same relationship between the word size and the CRS encoding 
performance. There are no specific word sizes w resulting in worse performance. But, we 
do notice that CRS performance decrease when the word size increase. In fact, all schemes 
present better encoding performance with small word size. For instance, RS implemented 
by ISA-Intel present the fastest encoding speed and it has better performance for w < 8. 

Figures 5.3, 5.4 and 5.5 summarize the results of these experiments using 64MB logical 
block size. They do not present any significant deviation from the behavior observed with 
100KB logical block size. Overall, we observe encoding speed is faster using a larger 
logical block size and small word size. Furthermore, ISA-Intel implements the fastest 
encoding performance among the four tested schemes. We expected to observe that Cauchy 
Reed-Solomon would be faster than Reed-Solomon for all experiments, but the experimental 
results did not comply with this intuition. There are a number of possible explanations for the 
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Figure 5.5: Word size vs. encoding speed for k= 14, m= 2, ^=100KB. 



Figure 5.6: Word size vs. encoding speed for k= 6, m=2, £=64MB. 
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Figure 5.7: Word size vs. encoding speed for k= 12, m= 4, 7=64MB. 



Figure 5.8: Word size vs. encoding speed for k= 14, m= 2, 7=64MB. 
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differences between the observations of Plank et al. and our observations: our experimental 
environment is different, our method of instrumenting measurement in the backend library 
may have failed to exclude some I/O costs, the liberasurecode library may have introduced 
some overhead that undermines the relative efficiency of the CRS backend implementation, 
or our experiment environment may have introduced some limiting factor that constrained 
performance in some unanticipated way. The result is that this preliminary performance 
experimentation shows that the encoding schemes present better encoding speed when 
small word size w is adopted. Thus, for the following evaluation, we will use the encoding 
schemes with the default k and m values given by the developer of Liberasurecode library 
and with word size w=S for JRS and w=4 for the other schemes. 


5.3.2 Pre-processing Performance 

Pre-processing is the procedure encoding the file F and outputting the file F. This operation 
includes file encoding and related I/O operations. Thus, the logical block size has direct 
effect on the total pre-processing time via the cost of read and write operations. We test 
four different logical block sizes: 50KB, 100KB, 64MB and 120MB. We generate random 
files of size 2kB to 1 GB and we run our tests 10 times for each of the four schemes 
evaluated (plotted results are averages). Figures 5.9 and 5.10 summarize the results of these 
experiments. 




(a) 50KB logical block size (b) 100KB logical block size 

Figure 5.9: File size vs. preprocessing time for £=50KB and /^lOOKB. 
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(a) 64MB logical block size 


(b) 128Mb logical block size 


Figure 5.10: File size vs. preprocessing time for /’=64MB and £=128MB. 


As expected, for large files sizes, we see a nearly linear relationship between file size and pre¬ 
processing time. For small file sizes, we observe surprisingly non-proportional performance 
we believe related to I/O costs or other activity interfering with our measurement. 


5.3.3 Tagging Performance 

The tagging performance is measured as the total time to generate tags associated with the 
encoded file F. We generate random files of size 2kB to 1 GB and we run our tests 10 
times for each of the four schemes evaluated (plotted results are averages). In this test, we 
used only RS implemented by Jerasure with the parameters k =10 and m =4 as the erasure 
code. Figure 5.11 summarizes the performance of generating tags using the POR scheme 
parameters described by Bremer, evaluated using our prototype library. 

As expected, the use of erasure coding increases the time to tag the encoded file relative 
to the time to tag the unencoded file. The trends we observe for each scheme are identical 
compared to those previously reported by Bremer [1], The tagging costs of the encoded file 
are a simple linear shift of those costs for unencoded files. 

The ratio m/k is the primary factor affecting this tag time. By increasing this ratio, we 
increase the size of the file to tag, which will increase the tag time proportionally. For 
the SEPDP scheme, the tag time increases linearly up to a point, after which the tag time 
remains constant, when the file size exceeds the fixed number of tags to be generated; these 


26 







Figure 5.11: File size vs. POR tag time. 


same trends for SEPDP are observed by Bremer. 


5.3.4 Storage overhead 

The storage overhead for F is the sum of the parity and the tag data. We measure the 
overhead for three of our POR schemes, excluding SEPDP. Table 5.1 shows the total storage 
overhead expected based on prior analysis. Figure 5.12 expresses the measured overheads 
experienced by our library, which match those expected closely. The results largely reflect 
the addition of m parity blocks to the encoded file storage. 
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Table 5.1: File storage overhead for each POR scheme (bs = 4096 bytes). 


Scheme 

Tag size (bytes) 

Tag file overhead (% |F|) 

Total storage overhead with ECC (% |F|) 

A-PDP 

204 

4.864% 

m/k + 4.864%(1 + m/k) 

MAC-PDP 

20 

0.477% 

m/k + 0.477%(1 + m/k ) 

CPOR 

18 

0.429% 

m/k + 0.429%(1 + m/k) 


Adapted from [1]: S. J. Bremer, “Cost comparison among provable data 
possession schemes,’’ M.S. thesis, Dept. Comput. Sci., Naval Postgraduate 
School (NPS), Monterey, 2016. 
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CHAPTER 6: 
Conclusion 


In this work, we analyzed the maximum distance separable codes as adversarial erasure 
codes to retrieve data stored in the cloud. Then we developed and added a retrievability 
interface to the PDP library, libpdp. Furthermore, we developed generic cost models for two 
erasure codes implemented by three different backend libraries. We analyzed the selection 
of parameters ( k, m, w ) on the performance of the erasure codes. Following that analysis, 
we studied the cost impact of the use of erasure codes on tagging files. Our results showed 
that using the erasure code adds additional tag cost. Then, we analyzed the storage overhead 
of three POR schemes. Our results showed that the integration of the erasure code will 
increase storage cost proportional to m/k. 

While our study covered only a static retrievability of data saved on a single server, future 
studies could incorporate a real-world cost of data updates on the performance of proof 
of retrievability. Also, we studied only one option of memory utilization in our library. 
Follow-on work could implement and study other options that could provide a more use¬ 
ful solution. In Chapter 5, we observed that cloud-based experimentation with Amazon 
instances introduces some unusual input/output behavior. Future work could investigate 
this issue, perhaps selecting an Amazon instance providing better input/output (I/O) perfor¬ 
mance for the purposes of stress-testing the encode and decode operations. Also, it could 
investigate the anomalies we say with respect to the relative speed of JRS and JRC, which 
did not reflect prior observations of Plank et al. 
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